In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()









    





<matplotlib.figure.Figure at 0x3755590>

Loading the training and test set

The first step is to load the training and set that was saved when running the Training and test feature set generation notebook.



In [2]:

    
cd ../../features/









    



/data/opencast/MRes/features



In [3]:

    
posdata = loadtxt("human.iRefIndex.positive.vectors.txt", delimiter="\t", dtype="str")

Training with the full negative set is impossible on this computer, it is unable to load the file into RAM. To get around this, can use a negative training set only 10 times larger than the positive training set:



In [4]:

    
negdata = loadtxt("human.iRefIndex.negative.vectors.txt", delimiter="\t", dtype="str")

There are some conflicting entries in this notebook detected in the training and test set generation. These can be removed using their indexes:



In [5]:

    
X = np.concatenate((posdata,negdata))



In [6]:

    
plottablerows = X=="missing"
plottablerows = where(plottablerows.max(axis=1)==False)

Unfortunately, some "NA" strings have sneaked in somehow, so we will have to zero these. Appears to be due to a feature in Gene_Ontology:



In [7]:

    
NAinds = where(X=="NA")



In [8]:

    
X[NAinds] = 0.0



In [9]:

    
X[X=="missing"] = np.nan
X = X.astype(np.float)

Finally we can create the target vector y from what we know about the lengths of the positive and negative sets:



In [10]:

    
#create the target y vector
y = array([1]*len(posdata)+[0]*len(negdata))

Creating a proportional training set

We would like the training set we use to reflect our prior estimation that only one in every 600 combinations of protein identifiers is a true interaction. To ensure that our training set also reflects this we need to sample from it to maintain this proportion of positive to negative.

An efficient way to do this is to create a modified y vector which is in proportion and maps back to the original y vector. Then, after calling StatifiedShuffleSplit the returned indices can be transformed back to the correct indices on the original y vector:



In [11]:

    
def propy(y,samplesize=20000):
    """Function to produce proportional samples of y from the full dataset.
    Takes as an input: y - the vector of class labels, samplesize - size of sample 
    Outputs: mody - modified y vector, transformdict - dictionary to map between indices"""
    enuy = list(enumerate(y))
    shuffle(enuy)
    nones = int(samplesize/600)
    nzeros = int((samplesize*599)/600)
    transformdict = {}
    mody = []
    #iterate over this list and pull out the required number of ones and zeros
    for j,(i,label) in enumerate(enuy):
        if label == 1 and nones > 0:
            transformdict[j] = i
            mody.append(label)
            nones += -1
        elif label == 0 and nzeros > 0:
            transformdict[j] = i
            mody.append(label)
            nzeros += -1
        if nzeros == 0 and nones == 0:
            break
    return mody, transformdict

Then all that's required is to create a sub-sampled, but still large, version of the X and y vectors:



In [12]:

    
mody,tfdict = propy(y,samplesize=400000)



In [13]:

    
propindexes = array(tfdict.values())
X,y = X[propindexes],y[propindexes]



In [14]:

    
len(X),len(y)









    Out[14]:





(399999, 399999)

Choosing a classifier

Each classifier will be tested in the following ways over :

Mean accuracy
Combined ROC curves
- AUC values
- Partial AUC values
Combined Precision vs. Recall curves
Informational loss function
Quadratic loss function
Kappa test?

The importance of different features will also be tested through elimination and repeating the tests above.

Finally, a sanity check looking at the model's prediction for some well-known protein interactions is the final test.

Logistic Regression

Logistic regression is a simple classifier which uses a linear transformation of the input variables through a sigmoid function.



In [15]:

    
cd tmp/









    



/data/opencast/MRes/features/tmp



In [16]:

    
import sklearn.linear_model



In [17]:

    
import sys, os
sys.path.append(os.path.abspath("../../opencast-bio/"))



In [18]:

    
import ocbio.model_selection, ocbio.mmap_utils
reload(ocbio.model_selection)
reload(ocbio.mmap_utils)









    Out[18]:





<module 'ocbio.mmap_utils' from '/data/opencast/MRes/opencast-bio/ocbio/mmap_utils.pyc'>



In [19]:

    
!ipcluster stop









    



2014-08-05 14:33:28.752 [IPClusterStop] Using existing profile dir: u'/home/fin/.ipython/profile_default'
2014-08-05 14:33:28.753 [IPClusterStop] CRITICAL | Could not read pid file, cluster is probably not running.



In [20]:

    
!ipcluster start -n=10 --daemon









    



2014-08-05 14:33:32.761 [IPClusterStart] Using existing profile dir: u'/home/fin/.ipython/profile_default'



In [21]:

    
from IPython.parallel import *



In [22]:

    
client = Client(profile='default')



In [23]:

    
#initialise load balanced distributed processing
lb_view = client.load_balanced_view()

Defining pipeline

We have to define a pipeline for the data to include two transformations for the data:

Imputing missing values
Standard scaling of the feature vectors

Both of these are built in operations of Scikit-learn. And they can be easily integrated to a pipeline in Scikit-learn.



In [17]:

    
import sklearn.pipeline, sklearn.preprocessing



In [25]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.linear_model.LogisticRegression()
model = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])

Plotting a learning curve

We would like to plot a learning curve to get an idea of the required size of our training set to train a model on k-fold validation steps to get valid results. Until we decide on a final classifier it is preferable to be able to run fast iterations of different parameters for more complex parameters than to absolutely maximimise performance. A learning curve plots the performance of the algorithm on the training and test set for different training set sizes.

As with the imported code, much of this code is taken from the excellent parallel machine learning tutorial by Oliver Grisel.



In [49]:

    
trainsizes=logspace(3,5,5)
#initialise and start run
lcsearch = ocbio.model_selection.LearningCurve(lb_view)
lcsearch.launch_for_arrays(model,X,y,trainsizes,n_cv_iter=5,params={'cls__C':0.316}, name="lrlc")









    Out[49]:





Progress: 00% (000/125)



In [53]:

    
print lcsearch









    



Progress: 100% (125/125)



In [54]:

    
lrlc = lcsearch.plot_curve()



In [55]:

    
savez("../../plots/bayes/logistic.regression.learning.curve.npz",lrlc)

This shows that we can test the model on a grid search with only 40,000 to 60,000 samples and still get meaningful results. When the grid search returns a best possible combination of parameters we will have to plot this learning curve again to make sure that the results are still valid. In the mean time, this will speed up our grid search considerably.

Running parallel grid search



In [109]:

    
!ipcluster stop --profile=altcluster









    



2014-08-05 16:49:25.606 [IPClusterStop] Using existing profile dir: u'/home/fin/.ipython/profile_altcluster'
2014-08-05 16:49:25.608 [IPClusterStop] CRITICAL | Could not read pid file, cluster is probably not running.



In [110]:

    
!ipcluster start -n=10 --profile=altcluster --daemon









    



2014-08-05 16:49:28.359 [IPClusterStart] Using existing profile dir: u'/home/fin/.ipython/profile_altcluster'



In [111]:

    
altclient = Client(profile='altcluster')



In [112]:

    
alb_view = altclient.load_balanced_view()



In [67]:

    
#make sure the processors aren't busy doing anything else:
alb_view.abort()
#initialise parameters to test:
lr_params = {
'cls__C':logspace(-3,1,7)
}



In [68]:

    
#initialise mmaped data -  had to increase CV iterations to reduce output variance
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=10000, train_size=40000,
                                            n_cv_iter=10, name='testfeatures',random_state=42)



In [69]:

    
#initialise and start run
lrsearch = ocbio.model_selection.RandomizedGridSearch(alb_view)
lrsearch.launch_for_splits(model,lr_params,split_filenames)









    Out[69]:





Progress: 00% (000/070)



In [76]:

    
print lrsearch.report()









    



Progress: 100% (070/070)

Rank 1: validation: 0.99840 (+/-0.00003) train: 0.99847 (+/-0.00002):
 {'cls__C': 0.021544346900318832}
Rank 2: validation: 0.99839 (+/-0.00003) train: 0.99846 (+/-0.00001):
 {'cls__C': 0.0046415888336127772}
Rank 3: validation: 0.99839 (+/-0.00003) train: 0.99845 (+/-0.00001):
 {'cls__C': 0.001}
Rank 4: validation: 0.99836 (+/-0.00004) train: 0.99850 (+/-0.00002):
 {'cls__C': 0.10000000000000001}
Rank 5: validation: 0.99836 (+/-0.00006) train: 0.99854 (+/-0.00002):
 {'cls__C': 0.46415888336127775}

With this data only one in every 18 examples will be positive therefore the expected accuracy of a classifier that only predicts zeros is:

$$ 1 - \frac{1}{18} = \frac{17}{18} = 0.9(4) $$

So this model is outperforming the hypothetical all zeros classifier.

We can implement this all zeros classifier and use it to create a comparative measure to make it easier to discriminate between classifiers.



In [62]:

    
import sklearn.dummy



In [63]:

    
def dummycompare(X,y,train,test,best_score):
    dummy = sklearn.dummy.DummyClassifier(strategy="most_frequent")
    dummy.fit(imputer.fit_transform(X[train]),y[train])
    score = dummy.score(imputer.fit_transform(X[test]),y[test])
    if score > best_score:
        return 0.0
    else:
        #using a logit function here on a biased and scaled best score
        best_score = (best_score-score)*(1.0/(1.0-score))
        return best_score,log(best_score/(1.0-best_score))



In [83]:

    
results = np.zeros([5,2])
for i,(train,test) in enumerate(sklearn.cross_validation.StratifiedShuffleSplit(y,n_iter=5,
                                                                  test_size=10000, train_size=40000)):
    results[i,:] = dummycompare(X,y,train,test,lrsearch.find_bests()[0][0])



In [84]:

    
print mean(results,axis=0)









    



[ 0.05882353 -2.77258872]

Plotting a ROC curve

A ROC curve indicates the tradeoffs possible with this classifier in terms of true positive versus false positive rate when setting the decision threshold. In some applications the price of a high false positive rate is acceptable to obtain a high true positive rate. As the false positive rate grows to 1.0 though, the true positive rate also grows to 1.0 inevitably for any classifier as all samples are being classified as positive.



In [26]:

    
def plotroc(model,Xtest,ytest):
    estimates = model.predict_proba(Xtest)
    fpr, tpr, thresholds = sklearn.metrics.roc_curve(ytest,estimates[:,1])
    clf()
    plot(fpr, tpr)
    plot([0, 1], [0, 1], 'k--')
    xlim([0.0, 1.0])
    ylim([0.0, 1.0])
    xlabel('False Positive Rate')
    ylabel('True Positive Rate')
    title('Receiver operating characteristic AUC: {0}'.format(sklearn.metrics.roc_auc_score(ytest,estimates[:,1])))
    show()
    return fpr,tpr



In [30]:

    
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=0.25)))



In [89]:

    
bestparams = lrsearch.find_bests()[0][-1]



In [90]:

    
%%time
model.set_params(**bestparams)
model.fit(X[train],y[train])
fpr,tpr = plotroc(model,X[test],y[test])









    












    



CPU times: user 1min 59s, sys: 3.19 s, total: 2min 2s
Wall time: 2min 3s

We can see that the AUC score is positive, which means that the classifier is performing better than chance.



In [91]:

    
savez("../../plots/bayes/logistic.regression.roc.npz",fpr,tpr)



In [92]:

    
print dummycompare(X,y,train,test,model.score(X[test],y[test]))









    



(0.071856287425090787, -2.5585184671321297)

Precision recall curve

Precision recall curves plot precision against recall.

Precision is the "fraction of retrieved instances which are relevant"
Recall is the "fraction of relevant instances which are retrieved"

These are implemented in Scikit-learn.



In [21]:

    
def drawprecisionrecall(X,y,test,model):
    try:
        yscore = model.decision_function(X[test])
    except AttributeError:
        yscore = model.predict_proba(X[test])[:,1]
    precision,recall,_ = sklearn.metrics.precision_recall_curve(y[test],yscore)
    average_precision = sklearn.metrics.average_precision_score(y[test], yscore,
                                                         average="micro")
    plt.clf()
    plt.plot(recall,precision)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall curve, AUC: {0:0.2f}'.format(average_precision))
    plt.show()
    return precision, recall



In [99]:

    
precision,recall = drawprecisionrecall(X,y,test,model)

This shows that it is not possible, even with loss of precision, to recall all of the positive training examples. Unfortunately, it is only possible to recall less than 10% with perfect precision.



In [100]:

    
savez("../../plots/bayes/lr_precisionrecall.npz",precision,recall)

Classifier coefficients

We can also plot the weightings which this model is using to achieve the above performance:



In [101]:

    
plot(model.named_steps['cls'].coef_.T)









    Out[101]:





[<matplotlib.lines.Line2D at 0x176de310>]



In [102]:

    
savez("../../plots/bayes/logistic.regression.coef.npz",model.named_steps['cls'].coef_.T)

Which feature is the largest peak in the above plot?



In [103]:

    
where(model.named_steps['cls'].coef_.T==amax(model.named_steps['cls'].coef_.T))









    Out[103]:





(array([0]), array([0]))

The largest peak in the above corresponds to a Gene Ontology feature. However, many of the features are almost as strongly weighted, showing that the classifier is not relying on a single feature.

Support Vector Machine

A support vector machine is a kernel-based classifier which aims to fit a hyperplane between training examples to separate the examples into different classes. The algorithm aims to maximise the distance from any given training point to the hyperplane in the high-dimensional space of the training vectors.

Building the pipeline

Optimal performance of a SVM with the default parameters depends on scaling of the data so we must build a pipeline for the model as before with logistic regression:



In [102]:

    
import sklearn.svm



In [103]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.svm.SVC()
svmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])

Plotting a learning curve

Again, it would be useful to get an idea of acceptable sample sizes prior to running the grid search to save processing time:



In [124]:

    
trainsizes=logspace(3,4.5,5)
#initialise and start run
lcsearch = ocbio.model_selection.LearningCurve(lb_view)
lcsearch.launch_for_arrays(svmodel,X,y,trainsizes,n_cv_iter=3,params={'cls__C':1.0}, name="lrlc")









    Out[124]:





Progress: 00% (000/075)



In [127]:

    
print lcsearch









    



Progress: 60% (045/075)



In [126]:

    
svmlc = lcsearch.plot_curve()



In [108]:

    
savez("../../plots/bayes/svmlc.npz",svmlc)

From this curve we can conclude that at least 20,000 training samples are going to be necessary for meaningful results.



In [128]:

    
#initialise parameters to test:
svm_params = {
'cls__C':logspace(-1,1,3),
'cls__kernel':['rbf','poly','sigmoid'],
'cls__gamma':logspace(-1,1,3)
}



In [129]:

    
#initialise mmaped data
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=2000, train_size=10000,
                                                     n_cv_iter=3, name='testfeatures',random_state=42)



In [117]:

    
#initialise and start run
search = ocbio.model_selection.RandomizedGridSearch(alb_view)
search.launch_for_splits(svmodel,svm_params,split_filenames)









    Out[117]:





Progress: 00% (000/081)



In [130]:

    
print search









    



Progress: 98% (080/081)

Rank 1: validation: 0.99850 (+/-0.00000) train: 0.99933 (+/-0.00012):
 {'cls__kernel': 'rbf', 'cls__gamma': 10.0, 'cls__C': 10.0}
Rank 2: validation: 0.99850 (+/-0.00000) train: 0.99933 (+/-0.00012):
 {'cls__kernel': 'rbf', 'cls__gamma': 1.0, 'cls__C': 10.0}
Rank 3: validation: 0.99850 (+/-0.00000) train: 0.99933 (+/-0.00012):
 {'cls__kernel': 'rbf', 'cls__gamma': 0.10000000000000001, 'cls__C': 10.0}
Rank 4: validation: 0.99850 (+/-0.00000) train: 0.99933 (+/-0.00012):
 {'cls__kernel': 'rbf', 'cls__gamma': 10.0, 'cls__C': 1.0}
Rank 5: validation: 0.99850 (+/-0.00000) train: 0.99933 (+/-0.00012):
 {'cls__kernel': 'rbf', 'cls__gamma': 1.0, 'cls__C': 1.0}

This is taking an extremely long time to train, and is not feasible in the sample sizes above. Quote from Murphy textbook on SVMs:

SVMs also take $O(N^{3})$ time to train...

Which is also repeated in the documentation for the Scikit-learn SVM classifier.

Due to this problem, it does not seem to be well-suited to this data, where we will primarily be working with very large sample sizes.



In [131]:

    
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=2000,train_size=10000)))
print len(train),len(test)



In [132]:

    
svmparams = search.find_bests()[0][-1]
svmparams['cls__probability'] = True



In [133]:

    
%%time
svmodel.set_params(**svmparams)
svmodel.fit(X[train],y[train])
fpr,tpr = plotroc(svmodel,X[test],y[test])









    












    



CPU times: user 1min 39s, sys: 9.48 s, total: 1min 48s
Wall time: 1min 49s

That is an extremely poor ROC curve, probably due to the extremely small training set. It is unavoidable, though, due to the time it takes to train Support Vector Machines.



In [136]:

    
cd tmp









    



/data/opencast/MRes/features/tmp



In [137]:

    
savez("../../plots/bayes/svm.roc.npz",fpr,tpr)



In [138]:

    
svmprecision,svmrecall = drawprecisionrecall(X,y,test,svmodel)



In [139]:

    
savez("../../plots/bayes/svm.precisionrecall.npz",svmprecision,svmrecall)

Random forest classifier

Random forest classifiers have proven to be effective in the task of protein interaction prediction in the past. A random forest classifier combines the results of many decision tree classifiers. In this way, it can react to correlations in the data in a way that simpler methods, such as logistic regression, cannot.

The combination of decision trees is implemented in Scikit-learn by averaging the results of the individual trees.

Initialising the pipeline

As above we will define the pipeline with the same scaling and imputation:



In [15]:

    
import sklearn.ensemble



In [18]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.ensemble.RandomForestClassifier()
rfmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])

Plotting a learning curve

As above, we would like to know how quickly this model will learn in a pipeline with the scaling and imputer defined as with previous models before trying to tune hyper-parameters.



In [125]:

    
trainsizes=logspace(3,5,7)
#initialise and start run
rflcsearch = ocbio.model_selection.LearningCurve(alb_view)
rflcsearch.launch_for_arrays(rfmodel,X,y,trainsizes,n_cv_iter=3,
                             params={'cls__n_estimators':100,'cls__bootstrap':False},
                             name="lrlc")









    Out[125]:





Progress: 00% (000/147)



In [135]:

    
print rflcsearch









    



Progress: 100% (147/147)



In [136]:

    
rflc = rflcsearch.plot_curve()

Training a single model with 100,000 to see how long each one of these is going to take:



In [137]:

    
savez("../../plots/bayes/random.forest.learning.curve.npz",rflc)

Running a grid search

The major parameters to vary with a Random Forest classifier are:

n_estimators - the number of trees in the forest
max_features - size of random subsets of features to consider when splitting nodes

According the to the documentation a good value for max_features is the square root of the number of features, in our case approximately 15. With the number of trees in the forests, the more trees the better the performance will be, but it will take longer to train the model. We will try up to a maximum of 200 estimators to see at which point there are no longer significant gains from increasing the number of estimators.

Also the random trees are using bootstrapping by default when training, which is unnecessary on a dataset as large as this. Therefore, we should set bootstrap=False.



In [36]:

    
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=20000, train_size=100000,
                                                     n_cv_iter=3, name='rffeatures',random_state=42)



In [37]:

    
rfparams = {'cls__n_estimators':100,'cls__bootstrap':False}



In [53]:

    
%%time
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=20000,train_size=100000)))
rfmodel.set_params(**rfparams)
rfmodel.fit(X[train],y[train])
print "Training score: {0}".format(rfmodel.score(X[train],y[train]))
print "Test score: {0}".format(rfmodel.score(X[test],y[test]))









    



Training score: 0.99925
Test score: 0.9984
CPU times: user 2min 38s, sys: 2min 21s, total: 5min
Wall time: 5min 52s



In [38]:

    
rf_params = {
'cls__n_estimators':logspace(1,2.3,5).astype(np.int),
'cls__max_features':arange(5,30,5).astype(np.int),
'cls__bootstrap':[False]
}



In [39]:

    
rfsearch = ocbio.model_selection.RandomizedGridSearch(lb_view)
rfsearch.launch_for_splits(rfmodel,rf_params,split_filenames)









    Out[39]:





Progress: 00% (000/075)



In [51]:

    
print rfsearch









    



Progress: 100% (075/075)

Rank 1: validation: 0.99837 (+/-0.00003) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 44, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 2: validation: 0.99835 (+/-0.00008) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 94, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 3: validation: 0.99835 (+/-0.00003) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 94, 'cls__max_features': 10, 'cls__bootstrap': False}
Rank 4: validation: 0.99833 (+/-0.00003) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 10, 'cls__max_features': 10, 'cls__bootstrap': False}
Rank 5: validation: 0.99833 (+/-0.00004) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 199, 'cls__max_features': 25, 'cls__bootstrap': False}

Training a the model we would like to use on a large training set to get the best possible performance:



In [52]:

    
rfparams = rfsearch.find_bests()[0][-1]



In [20]:

    
%%time
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
                                    test_size=20000,train_size=200000)))
rfmodel.set_params(**rfparams)
rfmodel.fit(X[train],y[train])
print "Training score: {0}".format(rfmodel.score(X[train],y[train]))
print "Test score: {0}".format(rfmodel.score(X[test],y[test]))









    



Training score: 0.999295
Test score: 0.99845
CPU times: user 8min 38s, sys: 2min 56s, total: 11min 35s
Wall time: 11min 56s



In [70]:

    
rfroc = plotroc(rfmodel,X[test],y[test])



In [71]:

    
savez("../../plots/bayes/random.forest.roc.npz",rfroc)



In [22]:

    
rfprec,rfrecall = drawprecisionrecall(X,y,test,rfmodel)



In [26]:

    
!git annex unlock ../../plots/bayes/random.forest.precisionrecall.npz









    



unlock ../../plots/bayes/random.forest.precisionrecall.npz (copying...) ok



In [27]:

    
savez("../../plots/bayes/random.forest.precisionrecall.npz",rfprec,rfrecall)

Finally, we can compare accuracy to the dummy case:



In [75]:

    
dummycompare(X,y,train,test,rfsearch.find_bests()[0][0])









    Out[75]:





(0.012121212121223947, -4.4006030202458293)

Extremely Randomized Trees

In Scikit-learn an alternative random forest in which the thresholds chosen for candidate features are randomly selected. This reduces the variance of the model at the cost of an increase in bias. These are called Extremely Randomized Trees.



In [48]:

    
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.ensemble.ExtraTreesClassifier()
erfmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])



In [49]:

    
extrasearch = ocbio.model_selection.RandomizedGridSearch(alb_view)
extrasearch.launch_for_splits(erfmodel,rf_params,split_filenames)









    Out[49]:





Progress: 00% (000/075)



In [50]:

    
print extrasearch









    



Progress: 100% (075/075)

Rank 1: validation: 0.99837 (+/-0.00003) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 94, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 2: validation: 0.99837 (+/-0.00003) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 21, 'cls__max_features': 20, 'cls__bootstrap': False}
Rank 3: validation: 0.99835 (+/-0.00005) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 44, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 4: validation: 0.99835 (+/-0.00003) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 21, 'cls__max_features': 25, 'cls__bootstrap': False}
Rank 5: validation: 0.99835 (+/-0.00003) train: 0.99921 (+/-0.00003):
 {'cls__n_estimators': 10, 'cls__max_features': 25, 'cls__bootstrap': False}

The performance does not appear to be much better than the random forest. For this reason, it is probably not worth using this classifier over a simple random forest.

Feature Importances

We can find the feature importances for a random forest model simply by querying the model:



In [76]:

    
importances = rfmodel.named_steps['cls'].feature_importances_



In [77]:

    
plot(importances)









    Out[77]:





[<matplotlib.lines.Line2D at 0x1dd54250>]



In [78]:

    
savez("../../plots/bayes/random.forest.importances.npz",importances)

Testing example interaction

Using a well-known interaction from the STRING database the classifier can be tested to ensure that it is able to detect some well-known interactions.

Pickling the classifier

The classifier with the best performance by looking at the ROC curve appears to be the Random Forest classifier, as expected.



In [80]:

    
cd ..









    



/data/opencast/MRes/features



In [82]:

    
import pickle



In [83]:

    
f = open("random.forest.bayes.classifier.pickle","wb")
pickle.dump(rfmodel,f)
f.close()

Classifying active zone network

A prepared network that is currently used at Edinburgh is available. To make the task of processing weighted edges easier we can apply our classifier to only these feature vectors and write the resulting feature vectors.



In [85]:

    
edgelist = loadtxt("human.activezone.txt",dtype=str)
NAinds = where(edgelist=="NA")
edgelist[NAinds] = 0.0
edgelist[edgelist=="missing"] = np.nan
edgelist = edgelist.astype(np.float)



In [86]:

    
estimates = rfmodel.predict_proba(edgelist)



In [87]:

    
print estimates[5,1]









    



0.00623885918004



In [90]:

    
import csv



In [91]:

    
f,fw = open("../forGAVIN/mergecode/OUT/edgelist.txt"), open("../forGAVIN/mergecode/OUT/edgelist_weighted.txt","w")
c,cw = csv.reader(f, delimiter="\t"), csv.writer(fw, delimiter="\t")
c.next()
for i,l in enumerate(c):
    cw.writerow(l + [estimates[i,1]])
f.close(),fw.close()









    Out[91]:





(None, None)

Pickling the class-conditional distributions

We would like to be able to update our Bayesian probability of interaction using these predictions, so we will pickle the ingredients needed to build a class-conditional beta distribution for the results of this classifier.



In [93]:

    
oneindexes = where(y==1)
zeroindexes = where(y==0)



In [94]:

    
onesamples = rfmodel.predict_proba(X[oneindexes])



In [100]:

    
zerosamples = rfmodel.predict_proba(X[zeroindexes])



In [101]:

    
f = open("random.forest.bayes.dist.samples.pickle","wb")
pickle.dump({1:onesamples,0:zerosamples},f)
f.close()